67 research outputs found

    Simulating High-Dimensional Multivariate Data using the bigsimr R Package

    Full text link
    It is critical to accurately simulate data when employing Monte Carlo techniques and evaluating statistical methodology. Measurements are often correlated and high dimensional in this era of big data, such as data obtained in high-throughput biomedical experiments. Due to the computational complexity and a lack of user-friendly software available to simulate these massive multivariate constructions, researchers resort to simulation designs that posit independence or perform arbitrary data transformations. To close this gap, we developed the Bigsimr Julia package with R and Python interfaces. This paper focuses on the R interface. These packages empower high-dimensional random vector simulation with arbitrary marginal distributions and dependency via a Pearson, Spearman, or Kendall correlation matrix. bigsimr contains high-performance features, including multi-core and graphical-processing-unit-accelerated algorithms to estimate correlation and compute the nearest correlation matrix. Monte Carlo studies quantify the accuracy and scalability of our approach, up to d=10,000d=10,000. We describe example workflows and apply to a high-dimensional data set -- RNA-sequencing data obtained from breast cancer tumor samples.Comment: 22 pages, 10 figures, https://cran.r-project.org/web/packages/bigsimr/index.htm

    Efficient Parallel Statistical Model Checking of Biochemical Networks

    Full text link
    We consider the problem of verifying stochastic models of biochemical networks against behavioral properties expressed in temporal logic terms. Exact probabilistic verification approaches such as, for example, CSL/PCTL model checking, are undermined by a huge computational demand which rule them out for most real case studies. Less demanding approaches, such as statistical model checking, estimate the likelihood that a property is satisfied by sampling executions out of the stochastic model. We propose a methodology for efficiently estimating the likelihood that a LTL property P holds of a stochastic model of a biochemical network. As with other statistical verification techniques, the methodology we propose uses a stochastic simulation algorithm for generating execution samples, however there are three key aspects that improve the efficiency: first, the sample generation is driven by on-the-fly verification of P which results in optimal overall simulation time. Second, the confidence interval estimation for the probability of P to hold is based on an efficient variant of the Wilson method which ensures a faster convergence. Third, the whole methodology is designed according to a parallel fashion and a prototype software tool has been implemented that performs the sampling/verification process in parallel over an HPC architecture

    Stimulus-Response Analysis for Data in the Form of Proportions

    No full text
    INTRODUCTION Dichotomous response models are common in many engineering settings, and they are an important endpoint in quality control and quality testing. Often, they represent the response of some experimental unit to an environmental or chemical stimulus, or of the unit over time, etc. Independent observations on each unit produce a value in the set {0,1} with some probability of binary response, p. A common design involves T populations, treatment groups, dose levels, etc. When some score or other quantification of the stimulus, x i (i=1,...,T), has been recorded along with the observations, an important issue for statistical study is the characterization of the stimulusresponse for use in prediction or assessment of the underlying phenomenon. Statistically, the recorded observations at the i th treatment level are taken as the number of "positive" outcomes, Y i , among the n i experimental units examined

    Tables Of P-Values For t- And Chi-Square Reference Distributions

    No full text
    INTRODUCTION An important area of statistical practice involves determination of P-values when performing significance testing. If the null reference distribution is standard normal, then many standard statistical texts provide a table of probabilities that may be used to determine the P-value; examples include Casella and Berger (1990), Hogg and Tanis (1997), Iman (1994), Moore and McCabe (1993), Neter et al. (1996), Snedecor and Cochran (1980), Sokal and Rohlf (1995), and Steel and Torrie (1980), among many others. If the null reference distribution is slightly more complex, however, such as a t-distribution or a x 2 -distribution, most standard textbooks give only upper-a critical points rather than actual P-values. With the advent of modern statistical computing power, this is not a major concern; most statistical computing packages can output P-values associated w

    On confidence bands and set estimators for the simple linear model

    No full text
    This paper reviews the duality between confidence bands and (convex) set estimators in a simple linear regression. Applications of this duality are explored. These include the nature of polygonal sets and the development of an algorithm that approximates the coverage probability of smooth confidence band functions.simultaneous inference linear regression linear segment confidence bands coverage probability approximation
    corecore